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Abstract. We compare an entropy estimator Hz recently discussed in m with 
two estimators Hi and H2 introduced in iniEi. We prove the identity Hz = Hi^ 
which has not been taken into account in m- Then, we prove that the statistical 
bias of Hi is less than the bias of the ordinary likelihood estimator of entropy. 
Finally, by numerical simulation we verify that for the most interesting regime of 
small sample estimation and large event spaces, the estimator H 2 has a significant 
smaller statistical error than Hz- 
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1. Introduction 

Symbolic sequences are typically characterized by an alphabet A of d different letters. 
We assume statistical stationarity, i.e. any letter-block (word or n-gram of constant 
length) Wi-, i = can be expected at any chosen site to occur with a known 

probability pi = prob(wi) and J2i=i P* = 1- 

In a classic paper published in 1951, Shannon considered the problem of 
estimating the entropy 

M 



( 1 ) 


of ordinary English [T]. In principle, this might be done by dealing with longer and 
longer contexts until dependencies at the word level, phrase level, sentence level, 
paragraph level, chapter level, and so on, have all been taken into account in the 
statistical analysis. In practice, however, this is quite impractical, for as the context 
grows, the number M of possible words explodes exponentially with n. 

In the numerical estimation of the Shannon entropy one can do frequency 
counting, hence in the limit of large data sets N, the relative frequency distribution 
yields an estimate of the underlying probability distribution. We consider samples of 
N independent observations, and let ki, i = 1,M, be the frequency of realization Wi 
in the ensemble. However, with the choice Pi = the naive (or likelihood) estimate 
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leads to a systematic underestimation of the Shannon entropy [2] [3] [1] [S] [B] [7] . In 
particular, if M is in the order of the number of data points iV, then fluctuations 
increase and estimates usually become significantly biased. By bias we denote the 
deviation of the expectation value of an estimator from the true value. In general, 
the problem in estimating functions of probability distributions is to construct an 
estimator whose estimates both fluctuate with the smallest possible variance and are 
least biased. 

On the other hand, there is the Bayesian approach to entropy estimation, building 
upon an approach introduced in [5], or a generalization recently proposed in [5]. 
There, the basic strategy is to place a prior over the space of probability distributions 
and then perform inference using the induced posterior distribution over entropy. 
Actually, a partial numerical comparison of the popular Bayesian entropy estimates 
and those discussed hereinafter can be found in [^. Unfortunately, these simulations 
only consider the bias of the entropy estimates but not their mean square error, which 
takes into account the important trade-off between bias and variance. However, in the 
considerations to be discussed below, for what we intend to demonstrate, no explicit 
prior information on distributions is assumed and we will focus ourself on Non-Bayes 
entropy estimates only. 

To start with, let us consider an estimator of the Shannon entropy which has 
recently been proposed and analyzed against the likelihood estimator [TU]. The 
development of this interesting estimator starts with a generalization of the diversity 
index proposed by Simson in 1949 m and refers to the following representation of 
the Shannon entropjl3 

OO - M 

1/—1 i=l 

In [TU], it has been mentioned that there exists an interesting estimator of each term in 
©, which is unbiased up to the order v = N — namely Z^jv^ where Zy is explicitly 
given by the expression 


= 


N^+''{N - - 1)1 ^ fc ^ 


m 




i=l j=0 


such that 


N-l 


is a statistical consistent entropy estimator of H with (negative) bias 

cso - M 

= - y] - y] Pi (1 - pi)'"- 


V 


(4) 

(5) 

( 6 ) 


Indeed, the estimator is notable because a uniform variance upper bound has been 
proven in uni that decays at a rate of 0(log(A^)/fV) for all distributions with finite 
entropy, compared to 0((log(iV))^/A^) of the ordinary likelihood estimator established 
in m- It should be mentioned here that the latter decay rate is an implication of the 
Efron-Stein inequality, whereas the former (faster) decay rate is derived within the 
completely different approach introduced in [10] . Actually, it seems hard to prove the 


I For another interpretation of this representation see m- 
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In the following section, we will show that is algebraically equivalent to the 
estimator mm 

M , 
i=l ^ 

while the summation is defined for all ki > 0 and the digamma function V’(fc) is the 
logarithmic derivative of the Gamma-function |14] . Actually, the estimator 0 is 
given for the choice ^ = 1 in [7] (Eq. (28) therein). In the asymptotic regime ki ^ 1 
this estimator leads to the ordinary Miller correction Hi ^ Hq + (M — I)/2A^. This 
can be seen by using the asymptotic relation ^ log(a:) — l/2x. 


The mathematical expression of the bias of Hi has also been derived in [7] and is 
explicitly given by 


B 


( 1 ) 

N 


M 


= 



t 


N-1 


1-t 


dt, 


( 8 ) 


with a uniform upper bound 

j(i)i - ^ 




N' 


(9) 


The proof of the identity Bn = will be suppressed here because it is sufficient to 
show the equivalence of the corresponding entropy estimators in the following section. 
It should be mentioned that the numerical computation time of the estimator Hi is 
significantly faster than for H^. Actually, this improvement has not been taken into 
account in reference [9] (Fig. II), where the authors still used expression ([5]) above. 


In the third section, by numerical computation we compare the mean square error 
of Hz with an entropy estimator corresponding to ^ = 1/2 in Eq. (13) of [7] (see also 
Eq. (35) of [6]), which is explicitly given by the following representation 

M j ki — ^ / 1 \ J 

^2 = E (^(^) - + E • (10) 

i=l ^ j=l ^ 


This estimator is an extension of by an oscillating term in the bracket on the right- 
hand side of 0. In both and [3, this estimator has not been expressed in terms 
of a finite sum, but by integral expressions or infinite sum representations instead. 
However, it can be easily shown that the present form is equivalent to those in mm, 
but the computation is less time-consuming. The bias of the estimator CQl) is [Z] 


M 

i=l 

with uniform upper bound 



^N-l 

1 - t 


dt, 


( 11 ) 



M + 1 
2N 


( 12 ) 


Now, when we look at the right-hand side of (jH]) and (HH), then we see that they mainly 
differ by a factor 2 in the denominator. That has the implication that 
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for all N and M > 2. Thus, we can expect a faster convergence of H 2 for sufficient 
large M and not very strongly peaked probability distributions. Actually, these are 
the distributions we are mainly interested in. The numerical comparison of the mean 
square error of and H 2 will be evaluated for the uniform probability distribution, 
the Zipf distribution and for the zero-entropy delta distribution. 

2. Comparison of Hz and Hi 

In this section, we show the identity Hz = Hi- Therefore, let Zi^i, denote the i-th 
term of (H]), 


iV! 


N 


j=o 


(13) 


By extending with N in the product, this expression can be rewritten as 


m 


j=o 


Next, the product is reformulated as a quotient of factorials, i.e. 
1/-1 


Y[iN-h-j) = 


j=o 


{N-h)\ 

{N-h-u)\ 


and in terms of binomial coefficients we get 


7 - k^ 

N 


N -V 
ki 


(14) 


(15) 


(16) 


Now, the i-th term of the estimator m is obtained by summation over z/, i.e. 
N-i ^ /■^ r \ -1 N-i 


E - 

7/ 




h E 


1 


uiN - u) 


N-V 




-1 — ' K - ki-l 


(17) 


while TLk = k-th harmonic number [U]. Applying the identity 

Hk-i = 4’{k) + 7 (with 7 = 0.5772..., the Euler-Mascheroni constant) and summa¬ 
tion for z = 1 , 2 ,..., M, we obtain the estimator which proves the identity Hz = Hi- 

In addition, we have the following 

Proposition. The estimator Hi is less biased than the likelihood estimator Hq. 


Proof. Since we know from [7] that the bias of Hi is negative, it is sufficient to 
prove that ipi^) ~ V'(^) > fog for 0 < A: < A^. The following inequalities [U] 


V'(7V) > log (A^ - - ) 

V'(fc) < fog(fc) - ^ 

can be applied such that we only have to check that 

1/2 


(18) 

(19) 


N > 


1 — e 


( 20 ) 
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Now, for any finite A: > 0, the inequality 1 + ^ < exp is satisfied. The proof is 
by Taylor series expansion of the exponential function. From this, by simple algebraic 
manipulations, it follows that the right-hand side of (OH)) is less than k + ^, for any 
finite fc > 0. It follows that (uni) is satisfied for any k with 0 < k < N. This proves 
that Hi is less biased then Hq. □ 

3 . Numerical comparison of and H2 

In this section, we will focus on the convergence rates of the root mean square error 
(RMSE) of Hz and H2- Here, the RMSE is defined by 

RMSE = Y^E[(i?-Er)2]. (21) 

We choose this error measure because it takes into account the trade-off between bias 
and variance. Moreover, we want to mention that there is a slightly modified version 
H* of the estimator H^, defined in Eq. (12) of [1^. Since the bias of Hz is explic¬ 
itly known, a correction is defined by subtraction of the bias term B]\f with pi replaced 
by its estimate pi. The modified estimator is then given by H* = Hz — Bn, while Bn 
is the plug-in estimator of For reasons of simplicity, we deny applying the same 
procedure of bias correction for the estimator H2. 

Our first data sample is taken from the uniform probability distribution pi = 1/M, 
for i = 1,2,M . In addition, we consider the (right-tailed) Zipf-distribution with 
Pi = c/i, for i = 1,2, ...,M and normalization constant c = I/Hm (reciprocal of 
the M-th harmonic number). The statistical error for increasing sample size N and 
given M is shown in Fig.[T] and Fig.[2j As we can see, the RMSE of all estimators is 
monotonic decreasing in N. The convergence of the naive estimator Hq is rather slow 
compared to the other estimators, while the performance of H/. is slightly better than 
for Hz- On the other hand, the statistical error of H2 is significantly smaller than the 
statistical error of Hz and H/ and this behaviour seems to be representative for large 
M. 

The statistical error for increasing M and fixed sample size N is shown in Fig. [3] 
and Fig.Ul For M ':S> N , the RMSE of Hz and H/ is greater than of i? 2 - This 
phenomenon reflects the fact that the bias reduction becomes more and more relevant 
for increasing M, compared to the contribution of the variance. 

As we can see from both examples, the gap between if* and H2 is slightly smaller 
for the peaked Zipf distribution compared to the uniform distribution. Thus, we ask 
for the performance in the extreme case of the delta distribution pi = 5i^i, which has 
entropy zero. Indeed, in this special case we have Hq = Hi = Hz = H* = 0 for any 
sample size N, but H 2 = log(2) -|- t 0 for N —>■ 00 . Actually, in this 

case the statistical error of the latter scales like ^ 1/2N for large N . 

4 . Summary 

In the present note, we classified the entropy estimator Hz of m within the 
family of entropy estimators originally introduced in [7]. This reveals an interesting 
connection between two different approaches to entropy estimation, one coming from 
the generalization of the diversity index of Simpson and the other one coming from 
the estimation of p/ in the family of Renyi entropies. This connection is explicitly 
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established by the identity = Hi. In addition, we proved that the statistical bias 
of Hi is smaller than the bias of the likelihood estimator Hu. 

Furthermore, by numerical computation for various probability distributions, we 
found that H^ (or the heuristic estimator H*) can be improved by the estimator H 2 , 
which is an excellent member of the estimator family in mm- 

On the other hand, there is a uniform variance upper bound of Hz (and therefore 
of Hi) that decays at a rate of 0{\og{N)/N) for all distributions with finite entropy 
[To] . It would be interesting to know if this variance bound also holds for the estimator 
Hq or H 2 . The answer might be found in a forthcoming publication. 
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Figure 1. Statistical error of Hq (□), Hz (o), H* (+) and H 2 (•), for 
the uniform probability distribution with M = 100 (see text). The 
RMSE of H 2 is significantly smaller then of Hz and H*. The exact 
value of the entropy is H = 5.3. 



Figure 2. Same as in Fig.^ but for Zipf’s probability distribution 
(see text). The exact value of the entropy is H = 3.68. 
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Figure 3. Statistical error of Hq (□), Hz (o), H* (+) and H 2 (•), 
for sample size = 10 in the instance of the uniform probability 
distribution. Small sample estimation is expected when M is above 
the sample size N. 



Figure 4. Same as in Fig.[3] but for the Zipf distribution. There is 
a crossover for M ^ N. 




